Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Models
Oğuzhan Ercan
x.com/oguzhannercan
Oğuzhan Ercan
x.com/oguzhannercan
Topics
Topics discussed
-Diffusion Models
-Diffusion Architectures
-Swap-Inpainting Models
-Output Control Techniques
-Inference Time Optimization
-Quality Enhancement
-Video Generation
-Face Models
Prerequisites
.
Probability
Statistic
Linear Algebra
Calculus
Deep Learning
Differential Equations
Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Models
Diffusion models consist of two interconnected processes: forward and backward. The forward diffusion process gradually corrupts the data by
interpolating between a sampled data point x0 and Gaussian noise. The formulation described placed below:
informations here mostly taken from Imagine Flash paper. https://arxiv.org/pdf/2405.05224
where alfa t and sigma t define the signal-to-noise ratio (SNR) of the stochastic interpolant X_t. In the following, we opt for coefficients (alfa t,
sigma t) that result in a variance-preserving process. . When viewed in the continuous time limit, the forward process in Eq above can be
expressed as the stochastic differential equation which is:
where f (x, t) : Rd Rd is a vector valued drift coefficient, g(t) : R R is the diffusion coefficient, and wt denotes the Brownian motion at time t.
Inversely, the backward diffusion process is intended to undo the noising process and generate samples. According to Anderson’s theorem, the
forward SDE introduced earlier satisfies a reverse-time diffusion equation, which can be reformulated using the Fokker-Planck equations to have a
deterministic counterpart with equivalent marginal probability densities, known as the probability flow ODE :
Oğuzhan Ercan
x.com/oguzhannercan
This allows to estimate formulation above, usually parameterized by a time-conditioned neural network. Given these estimates, we can sample using
an iterative numerical solver f:
To end up with first-order solvers like DDIM:
where the sample data estimate ˆx0 at time-step t is computed as:
Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Architectures
Following slides includes:
-High-Resolution Image Synthesis with Latent Diffusion Models 13 Apr 2022
-Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack 27 September 2023
-Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding 14
May 2024
-Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion
Transformers 9 May 2024
- Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis 29 Dec 2023
Oğuzhan Ercan
x.com/oguzhannercan
High-Resolution Image Synthesis with Latent Diffusion Models 13 Apr
2022
https://arxiv.org/pdf/2112.10752
Oğuzhan Ercan
x.com/oguzhannercan
Emu: Enhancing Image Generation Models Using Photogenic Needles in
a Haystack 27 September 2023 ( Image quality should always be prioritized over quantity.)
Their key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the
generation quality. Effective fine-tuning of LLMs can be achieved with a relatively small but high-quality fine-tuning dataset, e.g., using 27K
prompts.
They increase the channel of AE from 4 to higher dimension. They use additional adversarial loss for reconstruction and also they apply a non
learnable preprocessing step to RGB images using a fourier feature transform to lift the input channel dimension.
They use a large U-Net with 2.8B trainable parameters. They increase the channel size and number of stacked residual blocks in each stage for
larger model capacity. They use text embeddings from both CLIP ViT-L and T5-XXL as the text conditions.
They pre-train the model with 1.1B images. They train the model with progressively increasing resolutions. This approach improve finer details at
higher resolutions.
https://arxiv.org/pdf/2309.15807
Oğuzhan Ercan
x.com/oguzhannercan
Hunyuan-DiT : A Powerful Multi-Resolution Diffusion
Transformer with Fine-Grained Chinese Understanding 14 May 2024
https://arxiv.org/pdf/2405.08748
Oğuzhan Ercan
x.com/oguzhannercan
Lumina-T2X: Transforming Text into Any Modality, Resolution, and
Duration via Flow-based Large Diffusion Transformers 9 May 2024
https://arxiv.org/pdf/2405.05945
Oğuzhan Ercan
x.com/oguzhannercan
PIXART-α: FAST TRAINING OF DIFFUSION TRANSFORMER FOR
PHOTOREALISTIC TEXT-TO-IMAGESYNTHESIS 29 Dec 2023
https://arxiv.org/pdf/2310.00426
Oğuzhan Ercan
x.com/oguzhannercan
Swap-Inpainting Models
Following slides includes:
-SWAPANYTHING Enabling Arbitrary Object Swapping in Personalized Visual Editing 6 May 2024
Oğuzhan Ercan
x.com/oguzhannercan
SWAPANYTHING: Enabling Arbitrary Object Swapping in Personalized
Visual Editing 6 May 2024
https://arxiv.org/pdf/2404.05717
Oğuzhan Ercan
x.com/oguzhannercan
Output Control Techniques
Following slides includes:
-Adding Conditional Control to Text-to-Image Diffusion Models 26 Nov 2023
-ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback 11 Apr 2024
- CTRLororALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models 13 May 2024
Oğuzhan Ercan
x.com/oguzhannercan
Adding Conditional Control to Text-to-Image Diffusion Models
26 Nov 2023
https://arxiv.org/pdf/2302.05543
Oğuzhan Ercan
x.com/oguzhannercan
ControlNet++: Improving Conditional Controls
with Efficient Consistency Feedback 11 Apr 2024
https://arxiv.org/pdf/2404.07987
Oğuzhan Ercan
x.com/oguzhannercan
CTRLororALTer: Conditional LoRAdapter for Efficient
0-Shot Control & Altering of T2I Models 13 May 2024
https://arxiv.org/pdf/2405.07913
Oğuzhan Ercan
x.com/oguzhannercan
Inference Time Optimization
Following slides includes:
- SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions: Real 25 Mar 2024
-Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation 18 Apr 2024
- PeRFlow: Piecewise Rectified Flow as Universal Plug-and- Play Accelerator 13 May 2024
- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (Stability AI Turbo Solution) 18
March 2024
- Distilling Diffusion Models into Conditional GANs 9 May 2024
- Cross-Attention Makes Inference Cumbersome
- in Text-to-Image Diffusion Models 3 April 2024
- EdgeFusion: On-Device Text-to-Image Generation 18 April 2024
Oğuzhan Ercan
x.com/oguzhannercan
SDXS: Real-Time One-Step Latent Diffusion Models
with Image Conditions 25 Mar 2024
They introduce a dual approach involving model miniaturization and a reduction in sampling steps. The methodology leverages knowledge distillation
to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature
matching and score distillation.
VAE Decoder Optimization: Utilizing a pretrained diffusion model F to sample latent codes z and a pretrained VAE decoder to reconstruct images x,
we introduce a VAE Distillation (VD) loss for training a tiny image decoder G. they only build the G with CNN blocks to eliminate complex
components like attention and normalization (I do not know why they think that norm layers are computationally overwhelming.)
U-Net Optimization: They selectively removed residual and Transformer blocks from the U-Net, aiming to train a more compact model that can still
reproduce the original model’s intermediate feature maps and outputs effectively. Initializing noises and sampling images with an ODE to get noise
image pairs results in low quality images. So they use Rectified Flow tackles this challenge by straightening the sampling trajectories. Using MSE
loss causes the model tends to output the average of multiple feasible solutions. So they use SSIM. They also straighten the model’s trajectories to
narrow the range of feasible outputs using existing finetuning methods like LCM.
One Step Training: They first trained model for feature matching with SSIM Loss as warm-up. While at this stage, they sample noise - image pairs.
As trajectory of ODE’s (For example DDPM) are not straight, they used LCM-Lora for rectifying the flow. They say that warm-up training results are
good at image quality but do not capture the data distribution . For this reason they use score distillation sampling with learned manifold corrective.
https://arxiv.org/pdf/2403.16627
Oğuzhan Ercan
x.com/oguzhannercan
To achieve better quality at low step size, they propose to distill along the student’s backward path instead of the forward path. Put differently,
rather than having the student mimic the teacher, They use the teacher to improve the student based on its current state of knowledge. we propose
a Shifted Reconstruction Loss that dynamically adapts the knowledge transfer from the teacher model. Specifically, the loss is designed to distill
global, structural information from the teacher at high time steps, while focusing on rendering fine-grained details and high-frequency components
at lower time steps. They propose noise correction, a training free inference time modification that enhances sample quality.
Commonly chosen s.t. xT is not pure noise during training, but rather contains low-frequency information leaked from x0. xt = αtx0 + σtxT, here
the leakage reason, any stochastic interpolant xt, t < T still contains information from the ground-truth sample via the first summand αtx0.
backward distillation eliminates information leakage at all time steps t, preventing the model from relying on a ground-truth signal. This is achieved
by simulating the inference process during training, which can also be interpreted as calibrating the student on its own upstream backward path.
They first perform backward iterations of the student model to obtain , then use this as input for
both the student and teacher models during training.
For distillation loss, they define shifted reconstruction loss which is designed such that for higher values of t, the target produced by the teacher
model displays global content similarity with the student output but with improved semantic text alignment; and for lower values of t, the target
image features enhanced fine-grained details while maintaining the same overall structure as the student’s prediction.
When t=T, which is pure noise, at that time step predicting the noise is not informative. So existing works propose predicting the velocity which is
the rate of change. Unfortunately, converting a model to velocity prediction requires extra training efforts. They present training free method, by
treating t = T as a unique case and replacing ϵΘ with the true noise xT , the update f is corrected.
Shifted Reconstruction Loss
Imagine Flash: Accelerating Emu Diffusion Models with Backward
Distillation 18 Apr 2024
https://arxiv.org/pdf/2405.05224
Oğuzhan Ercan
x.com/oguzhannercan
PeRFlow: Piecewise Rectified Flow as Universal Plug-and-
Play Accelerator 13 May 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2405.07510
Oğuzhan Ercan
x.com/oguzhannercan
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion
Distillation (Stability AI Turbo Solution) 18 March 2024
Authors says that Adversarial Diffusion Distillation was a big move but usage of the fixed and pretrained DINOv2 network restricts the discriminator’s
training resolution to 518 × 518 pixels. Also there is no straightforward way to control the feedback level of the discriminator. Plus as Yann
lecun said that “need to decode to RGB space” is a problem. They say smaller discriminator
feature networks often offer better performance than their larger counterparts.
They distill generative features of a pretrained diffusion model instead of DINOv2. By targeted sampling of the noise levels during training, it
can be bias the discriminator features towards more global (high noise level) or local (low noise level) behavior.
many distillation techniques attempt to learn “simpler” differential equations that result in the same distribution at t=0 however with
“straighter”, more linear, trajectories which allows for larger step sizes and therefore less evaluations of the network.
LADD introduces two main modifications: the unification of discriminator and teacher model, and the adoption of synthetic data for training.
They first generate an image with teacher model. Then they add noise to the generated image, after that they denoise the image with both
teacher and student networks. They calculate the loss over these latent space representations.
They also fed the students output to the teacher model. After each layer of teacher model which is a transformer, they add a discriminator
head and calculate the adversarial loss with these heads.
in one-shot scenarios, CFG simply oversaturates samples rather than improving text-alignment. This observation suggests that CFG works
best in settings with multiple steps, allowing for corrections of oversaturation issues ins most case. Also they see that while distillation loss
benefits training with real data, it offers no advantage for synthetic data. Thus, training on synthetic data can be effectively conducted using
only an adversarial loss.
https://arxiv.org/pdf/2403.12015
Oğuzhan Ercan
x.com/oguzhannercan
Distilling Diffusion Models into Conditional GANs 9 May 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2405.05967
Oğuzhan Ercan
x.com/oguzhannercan
Cross-Attention Makes Inference Cumbersome
in Text-to-Image Diffusion Models 3 April 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2404.02747
Oğuzhan Ercan
x.com/oguzhannercan
EdgeFusion: On-Device Text-to-Image Generation 18 April 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2404.11925
Oğuzhan Ercan
x.com/oguzhannercan
Quality Enhancement
Following slides includes:
-Align Your Steps, 22 april 2024
Oğuzhan Ercan
x.com/oguzhannercan
Align Your Steps, 22 april 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2404.14507
Oğuzhan Ercan
x.com/oguzhannercan
Face Models
Following slides includes:
-InstantID: Zero-shot Identity-Preserving Generation in Seconds 2 Feb 2024
Oğuzhan Ercan
x.com/oguzhannercan
InstantID: Zero-shot Identity-Preserving Generation in Seconds
2 Feb 2024
They used a pre-trained face model to detect and extract face ID embedding from the reference facial image, providing us with strong identity
features to guide the image generation process.
Image Adapter: a lightweight adaptive module with decoupled cross-attention is introduced to support images as prompts. However, They diverge
by employing ID embedding as their image prompt, as opposed to the coarse-aligned CLIP embedding. This choice is aimed at achieving a more
nuanced and semantically rich prompt integration.
Directly adding the text and image tokens in cross-attention tends to weaken the control exerted by text tokens. So they adapt a controlnet, named
as IdentityNet. At this net, they use 5 landmarks instead of 68 (openpose) and instead of text embedding they use Arcface identities at cross
attention layer.
https://arxiv.org/pdf/2401.07519
Oğuzhan Ercan
x.com/oguzhannercan
Video Generation
Following slides includes:
-STORYDIFFUSION: CONSISTENT SELF-ATTENTION FOR LONG-RANGE IMAGE AND VIDEO GENERATION
2 May 2024
Oğuzhan Ercan
x.com/oguzhannercan
STORYDIFFUSION: CONSISTENT SELF-ATTENTION
FOR LONG-RANGE IMAGE AND VIDEO GENERATION 2 May 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2405.01434